This 'Introduction to R for Data Science' course is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This CSB1020 was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
This lesson is the first in a 7-part series. At the end of the series, you will be able to import and format your data, make exploratory plots, perform some statistical tests, and make publication-quality plots and documents to share your results.
At the end of this session you will be familiar with the Jupyter Notebook environment and the R-kernel associated with it. You will know about basic data structures in R and how to create them. You will be able to install and load packages.
Use: these concepts are necessary for coding best practices and to understand your data types for analysis.
In the first two lessons, we will talk about the basic data structures and objects in R, get cozy with the RStudio environment, and learn how to get help when you are stuck. Because everyone gets stuck - a lot! Then you will learn how to get your data in and out of R, how to tidy our data (data wrangling), subset and merge data, and generate descriptive statistics. Next will be data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. After that, we will make all sorts of plots for both data exploration and publication. Lastly, we will learn to write customized functions and apply more advanced statistical tests, which really can save you time and help scale up your analyses.
The structure of the class is a code-along style: It is fully hands on. Prior to each lecture, the materials will available for download at GitHub, so you can spend more time coding than taking notes.
Each week, new lesson files will appear within your JupyterHub folders. We are pulling from a GitHub repository using this link
Simply click on the link and it will take you to the University of JupyterHub. You will need to use your UTORid credentials to complete the login process.
From there you will find each week's lecture files in the directory /2021-01-IntroR/Lecture_XX
You can open and download the Jupyter Notebook (.ipynb) files to your personal computer if you would like to run independently of the JupyterHub.
A live lecture version will be available here that will update as the lecture progresses. Be sure to refresh to take a look when you get lost!
Work with your Jupyter Notebook on the University of Toronto JupyterHub will all be contained within a new browser tab with the address bar showing something like
https://jupyter.utoronto.ca/user/assigned-username-hexadecimal/tree/2021-01-IntroR
All of this is running non-locally on a University of Toronto server rather than your own machine.
You'll see a directory structure from your home folder:
ie \2021-01-IntroR\ and a folder to Lecture_01 within. Clicking on that, you'll find Lecture_01.skeleton.ipynb which is the notebook we will use for today's little code-along.
We've implemented the class this way to reduce the burden of having to install various programs. While installation can be a little tricky, it's really not that bad. For this introduction course, however, you don't need to go through all of that just to learn the basics of coding.
Jupyter Notebooks also give us the option of inserting "markdown" text much like what your reading at this very exact moment. So we can intersperse ideas and information between our learning code blocks.
There is, however and appendix section at the end of this lecture detailing how to install Jupyter Notebooks (and the R-kernel for them) as well as independent installation of the R-kernel itself and a great integrated development environment (IDE) called RStudio.
R is a language and an environment because it has the tools and software for the storage, manipulation, statistical analysis, and graphical display of data. It comes with about 25 built-in 'packages' and uses a simple programming language ("S"). The core information and programming that makes up R is called the kernel. We may refer to this concept interchangeably as the R-kernel or r-base. A useful resource is the "Introduction to R" found on CRAN.
The R-kernel interprets the human-readable code we create (syntax) to perform operations behind the scenes. By combining the available basic functions provided by the R-kernel, we can create more complex actions culminating in output from mathematical analysis to beautiful data graphs.
So... what are in these packages? A package can be a collection of
Functions are the basic workhorses of R; they are the tools we use to analyze our data. Each function can be thought of as a unit that has a specific task. A function takes an input, evaluates it using an expression (e.g. a calculation, plot, merge, etc.), and returns an output (a single value, multiple values, a graphic, etc.).
In this course we will rely a lot on a package called tidyverse which is also dependent upon a series of other packages.
Users have been encouraged to make their own packages. There are now over 12,000 packages on R repositories (banks of packages), including 12,000 on CRAN (Comprehensive R Archive Network) and about 1,500 on Bioconductor.
The "Comprehensive R Archive Network" (CRAN) is a collection of sites that have the same R and related R material:
Different sites (for example, we used http://cran.utstat.utoronto.ca/), are called mirrors because they reflect the content from the master site in Austria. There are mirrors worldwide to reduce the burden on the network. CRAN will be referred to here as a main repository for obtaining R packages.
Bioconductor is another repository for R packages, but it specializes in tools for high-throughput genomics data. One nice thing about Bioconductor is that it has decent vignettes. A vignette is the set of documentation for a package, explaining its functions and usages in a tutorial-like format.
Behind the scenes of each Jupyter notebook a programming kernel is running. For instance, our notebooks run an "emulated" R-kernel to interpret each code cell as if it were written specifically for the R language.
As we move from code cell to new code cell, all of objects we have created are stored within memory. We can refer to these as we run the code and move forward but if you overwrite or change them by mistake, you may to have rerun multiple cell blocks!
There are some options in the "Cell" menu that can alleviate these problems such as "Run All Above". If you think you've made a big error by overwriting a key object, you can use that option to "re-initialize" all of your previous code!
Remember these friendly keys/shortcuts:
Esc to enter "Command Mode" which basically takes you outside of the cell.Enter to edit a cellArrow keys to navigate up and down (and within a cell)Ctrl+Enter to run a cell (both code and markdown)Shift+Enter to run the current cell and move to the next one belowCtrl+/ to quickly comment and uncomment single or multiple lines of codeIn Command mode
a insert a new cell above the currently selected cellb insert a new cell below the currently selected cell
** Note that cells are defaulted to code cellsm converts a code cell to a markdown celly converts a markdown cell to a code cellDepending on your needs, you may find yourself doing the following:
Jupyter allows you to alternate between "markdown" notes and "code" that can be run or re-run on the fly.
Each data run and it's results can be saved individually as a new notebook to compare data and small changes to analyses!
A flagship IDE for R is RStudio. It runs the R-kernel but offers additional tools and interfaces that allow the user and programmer to see and understand their code much better than just R by itself.
RStudio simplifies some basic tasks like
"What if I'm doing more than just running data through packages?"
As a development environment RStudio offers features like debugging, and access to environmental variable states. It is a fully integrated development enivronment that makes it easy to look for help on package and function information, saving data states to come back to later, working on multiple scripts that may reference into each other. It has decent user interface that can make looking at certain objects like "tables" much easier too.
I suggest you try out both! Find what's comfortable for you and experiment with whatever works best for your needs!
Personally I use R/RStudio to generate code but after building this class as a Jupyter Notebook, this really is a good tool for running smaller code snippets, especially in the context of working or talking with supervisors and collaborators. Many times, they may want to know something like
You can make quick changes on the fly and see the results there in the notebook without pulling up extra windows or programs. New runs can be saved in different versions of the notebook with quick footnotes on what has changed.
Again, consider it on a case-by-case basis...
Let's discuss some important behaviours before we begin coding:
#¶Why bother?
Your worst collaborator is potentially you in 6 days or 6 months. Do you remember what you had for breakfast last Tuesday?
You can annotate your code for selfish reasons, or altruistic reasons, but annotate your code.
How do I start?
It is, in general, part of best coding practices to keep things tidy and organized.
A hash-tag # will comment your text. Inside a code cell in a Jupyter Notebook or anywhere in an R script, all text after a hashtag will be ignored by R and by many other programming languages. It's very useful to add comments about changes in your code, as well as detailed explanations about your scripts.
Put a description of what you are doing near your code at every process, decision point, or non-default argument in a function. For example, why you selected k=6 for an analysis, or the Spearman over Pearson option for your correlation matrix, or quantile over median normalization, or why you made the decision to filter out certain samples.
Break your code into sections to make it readable. As you learn Control Flow (Lecture 07) you will realize scripts are just a series of steps and major steps should be titled/outlined with your reasoning - much like when presenting your research.
Give your objects informative object names that are not the same as function names.
Comments may/should appear in three places:
# At the beginning of the script, describing the purpose of your script and what you are trying to solve
bedmasAnswer <- 5 + 4 * 6 - 0 #In line: Describing a part of your code that is not obvious what it is for.
Maintaining well-documented code is also good for mental health!
Keyboard shortcuts in RStudio:
CTRL + SHIFT + C (Windows, Linux) / Command + SHIFT + C (Mac) CTRL + SHIFT + / (Windows, Linux) / Command + SHIFT + / (Mac) Basically, you have the following options:
The most important aspects of naming conventions are being concise and consistent!
Start each script with a description of what it does.
Then load all required packages.
Consider what working directory you are in when sourcing a script.
Use comments to mark off sections of code.
Put function definitions at the top of your file, or in a separate file if there are many.
Name and style code consistently.
Break code into small, discrete pieces.
Factor out common operations rather than repeating them.
Keep all of the source files for a project in one directory and use relative paths to access them.
Keep track of the memory used by your program.
Always start with a clean environment instead of saving the workspace.
Keep track of session information in your project folder.
Have someone else review your code.
Use version control.
For more information on best coding practices, please visit swcarpentry
We all run into problems. We'll see a lot of mistakes happen in class too! That's OK if we can learn from our errors and quickly (or eventually) recover.
getwd() to check where you are working, typelist.files() or the Files pane to check that your file exists there, and setwd() to change your directory if necessary. Preferably, work inside an R project with all project-related files in that same folder. Your working directory will be set automatically when you open the project (this can be done by using File -> New Project and following prompts).typeof() and class() to check what type of data you have. Use str() to peak at your data structures if you're making assumptions about it.help("function"), ?function (using the name of the function that you want to check), or help(package = "package_name"). Help tab (which is also searchable). library("package_name"). If you only need one function from a package, or need to specify to what package a function belongs because there are functions with the same name that belong to different packages, you can use a double colon, i.e. package_name::function_name.session aborted can happen for a variety of reasons, like not having enough computational power to perform a task or also because of a system-wide failure. 0. You will need to rerun your previous cells! , or extra ) still happens to me too!At this level, many people have had and solved your problem. Beginners get frustrated because they get stuck and take hours to solve a problem themselves. Set your limit, stay within it, then go online and get help.
Remember: Everyone looks for help online ALL THE TIME. It is very common. Also, with programming there are multiple ways to come up with an answer, even different packages that let you do the same thing in different ways. You will work on refining these aspects of your code as you go along in this course and in your coding career.
Last but not least, to make life easier: Under the Help pane, there is a Cheatsheet of Keyboard Shortcuts or a browser list here.
This is an image of a possible directory.
In this hierarchy we will pretend to be benedict, and we are hanging out in our Tables folder. R looks to read in your files from your working directory, which in this case would be Tables. At this moment, R would have access to proof.tsv and genes.csv. If I tried to open paper.txt under benedict, R would tell me there is no such file in my current working directory.
To get your working directory in R you would type in your code cell:
getwd()
You would then press Ctrl+Enter (Ctrl+Enter in Linux, command+Enter in Mac) to execute the code in the cell.
The output below your Console would be:
'/home/benedict/Tables'
R will always tell you your absolute directory. An absolute directory is a path starting from your root "/". The absolute directory can vary from computer to computer. My home directory and your home directory are not the same; our names differ in the path.
To move directories, it is good to know a couple shortcuts. '.' is your current directory. '..' is up one directory level. '~' is your home directory (a shortcut for "/home/benedict"). Therefore, our current location could also be denoted as "~/Tables".
To move to the directory ewan we use a function that will set the working directory:
setwd("/home/ewan")
or
setwd("~/ewan")
A relative directory is a path starting from wherever you currently are (AKA your working directory). This path could be the same on your computer and my computer if and only if we have the same directory structure.
If I wanted to move back to Tables using the absolute path, I would set a new working directory:
setwd("/home/benedict/Tables")
or
setwd("~/benedict/Tables")
And the relative path would be:
setwd("../benedict/Tables")
There is some talk over setting working directories within scripts. Obviously, not everyone has the same absolute path, so if must set a directory in your script, it is best to have a relative path starting from the folder your script is in. Keep in mind that others you share your script with might not have the same directory structure if you refer to sub-directories.
You can set your working directory by:
setwd()In RStudio you may also...
Session -> Set Working Directory (3 Options) Files Window -> More (Gear Symbol) -> Set As Working Directory # addition
1+2
# exponents
20^(1/2)
# basic math functions
sqrt(20)
# advanced math functions
factorial(3)
# access to constants
pi
exp(1)
curve(10*x^2)
curve(10*x^2, xlim=c(-2,2))
# Problem is that these will all run at once and then only ?plot will be made visible
?help
?str
# ?c
# ?seq
# ?setwd
# ?sort
# ?dir
# ?head
# ?names
# ?summary
# ?dim
# ?range
# ?max
# ?min
# ?sum
# ?pairs
# ?plot
Up until now we've simply been calculating with R and the output appears after the code cell. There is nothing left behind in the R interpreter or memory. If we wanted to hold onto a number or calculation we would need to assign it to a named variable. In fact R has multiple methods for assigning a value to a variable and an order of precedence!
-> and ->> Rightward assignment: we won't really be using this in our course.
<- and <<- Leftward assignment: assignment used by most 'authentic' R programmers but really just a historical throwback.
= Leftward assignment: commonly used token for assignment in many other programming languages but carries dual meaning!
Notes
Let's try some exercises.
# Assign with the standard =
a=4
a
print(a)
[1] 4
#Left hand assigner (original way to assign results)
# sometimes the <- is necessary for functions originally created in S. often seen on R help forums if you Google future issues
a<-3
a
#Right hand assigner
3 -> d
d
a=4
b=2
a*b
# each code after a semicolon is interpreted as a line
a=4;b=2;a*b
Spaces are used to separate between commands and variables as the code is run but the total number of spaces is irrelevant to the interpreter when it is running your code.
Let's see it in action
b=2
a=3;b*a
#Vs
a = 3; b* a
A more complex example
Using spaces to organize the above code:
Under some special circumstances, spaces are required, e.g. when using function paste() and its argument sep = ' '.
paste("Can", "I", "go", "out", "now", "?", sep='')
paste("Can", "I", "go", "out", "now", "?", sep=' ')
paste("Can", "I", "go", "out", "now", "?", sep=' ')
paste("Can", "I", "go", "out", "now", "?", sep='_')
paste("Can", "I", "go", "out", "now", "?", sep='hi')
R calculates the right side of the assignment first the result is then applied to the left. This is a common paradigm in programming that simplifies variable behaviours for counting and tracking results as they build up over time.
This also allows us to increment variables or manipulate objects to update them!
i = 1
i <- i + 1
i <- i + 1
i <- i + 1
i <- i + 1
i = i + 1
i
This behaviour can be extended in a more complex fashion to encompass multiple variables
Remember: variables are specific identifiers/placeholders that allow us to access our data. We can change the nature and size of that data simply by reassigning the value of variable.
# Add multiple things
result <- 5 * a + 2 * b
result
# make a calculation
result ^ pi
# Use and overwrite the current value of "result"
result <- result ^ pi
# this PERMANENTLY overwrote your old 'result' object. Remember not to SCREW yourself
result
# Don't forget that variable names ARE case-sensitive
a = 5
A = 7
b = 3
B = 15
b; B; a; A; a + A;
What do I mean by 'types' of data?
The job of data structures is to "host" the different data types. There are five types of data structures in R:
One single value from any of the above data types. It is the smallest possible "unit" of data within R.
X <- 5
b = 10
this.word = "characterInformation"
X
b
this.word
There is a numerical order to a vector, much like a queue AND you can access each element (piece of data) individually or in groups.
Here are what vectors of each data 'type' would look like. Note that character items must be in quotations. 'L' is placed next to a number to specify an integer.
# character vectors
character_vector <- c('bacteria', "virus", 'archaea')
character_vector
# numeric vectors
numeric_vector <- c(1:10)
numeric_vector
# logical vectors
logical_vector <- c(TRUE, FALSE, TRUE) # TRUE and FALSE are also know as "boolean"
logical_vector
# Integer vectors
integer_vector <- c(1L, 8L)
# The "L" makes the numbers integers. Can be used to get your code to run faster and consume less memory.
# A double ("numeric") vector uses 8 bytes per element. An integer vector uses only 4 bytes per element
integer_vector
# What happens if we try to include more than one type of data?
mixed_vector <- c("bacteria", 1, TRUE, NA)
mixed_vector
# Let's look at the structure of our vector
str(mixed_vector)
chr [1:4] "bacteria" "1" "TRUE" NA
R will coerce (force) your vector to be of one data type, in this case the type that is most inclusive is a character vector. When we explicitly force a change from one data type to the next, it is known as casting.
# Let's convert our mixed_vector
into_numeric <- as.numeric(mixed_vector); into_numeric
# What about our logical_vector?
as.numeric(logical_vector)
Warning message in eval(expr, envir, enclos): "NAs introduced by coercion"
Let's highlight the above error for a couple of reasons:
Keep your data types in mind. It is good practice to look at your object or the global environment to make sure the object that you just made is what you think it is.
It can be useful for data analysis to be able to switch from TRUE/FALSE to 1/0, and it is pretty easy, as we have just seen.
You can name the contents of your vectors or specify them upon vector creation.
names(logical_vector) <- c("male", "elderly", "heart attack")
logical_vector
#is equivalent to
logical_vector <- c("male" = TRUE, "elderly"=FALSE, "heart attack"=TRUE)
logical_vector
# or using a ;
logical_vector <- c("male" = TRUE, "elderly"=FALSE, "heart attack"=TRUE); logical_vector
# The number of elements in a vector is its length.
length(character_vector)
length(numeric_vector)
length(logical_vector)
character_vector
# You can grab a specific element by its index, or by its name.
character_vector[3]
#second and third element in the vector inclusive (varies across programming languages)
character_vector[2:3]
#you can use negative indexing to select 'everything but'
character_vector[-2]
character_vector[-c(2,3)]
# You can grab elements by their assigned "names"
logical_vector["male"]
Matrices are can be accessed similar to vectors but in a [row,column] format
In R, functions within functions are read inside-out, i.e. moving from the inner most parenthesis and outwards:
matrix(c(rep(0, 10), rep(1,10)), nrow = 5, ncol = 5)
Here the two rep(...) functions will be evaluated before evaluating matrix(...)
c(rep(0, 10), rep(1,10))
my_matrix <- matrix(c(rep(0, 10), rep(1,10)), nrow = 5, ncol = 5)
my_matrix
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 1 | 0 |
# Is equivalent to
c(rep(0:1, each = 10, times =2))
my_matrix <- matrix(c(rep(0:1, each = 10, times =2)), nrow = 5, ncol = 5)
my_matrix
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 1 | 0 |
What has happened here? Look up the rep() function. Why has R not thrown an error? How would I make this same matrix without vector recycling? Can you think of 2 ways?
# Make a matrix by recycling within rep() but not within the matrix() call
my_matrix <- matrix(c(...), nrow = 5, ncol = 5); my_matrix; print("version 1")
# No recycling in generating the initial vector
my_matrix <- matrix(c(...), nrow = 5, ncol = 5); my_matrix; print("version 2")
# Just write out the entire vector you want to convert to matrix
my_matrix <- matrix(c(...), nrow = 5, ncol = 5); my_matrix; print("version 3")
# We can replicate through coercion!
my_matrix <- matrix(as.numeric(c(...)), nrow = 5, ncol = 5); my_matrix; print("version 4")
#This may be a good time to mention that TRUE and FALSE can be abbreviated to T and F
my_matrix <- matrix(as.numeric(c(...)), nrow = 5, ncol = 5); my_matrix; print("version 5")
Notice how the matrices are "filled" one column at a time?
Challenge: What do you think this matrix will look like?
my_matrix <- matrix(c(0,0,0,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,0), nrow = 5, ncol = 5)
my_matrix <- matrix(c(0,0,0,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,0), nrow = 5, ncol = 5)
my_matrix
| 0 | 1 | 1 | 0 | 1 |
| 0 | 0 | 1 | 0 | 1 |
| 0 | 0 | 1 | 0 | 1 |
| 1 | 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 | 0 |
matrix(c(rep(0,10), rep(1,10)), nrow = 5, ncol = 5)
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 1 | 0 |
What happened above?
R code is evaluated inside-out but the brackets here are poorly positioned. With the command above you end up with a single column matrix of numbers equivalent to c(0x10, 1x10, 5, 5).
Remember to be mindful of your bracket placement or you'll be in for some headaches!
Challenge
Make a 4 x 4 matrix that looks like this, using the seq() function at least once.
2 4 6 8 10 12 3 6 9 12 0 1 0 1 0 1
</br>
matrix(c(seq(...), seq(...), rep(...)), nrow = 4, ncol = 4, byrow = TRUE)
#etc
# A matrix is a 2D object. We can now check out a couple more properties - like the number of rows and columns.
my_matrix
print("structure")
str(my_matrix)
print("rows")
nrow(my_matrix)
print("columns")
ncol(my_matrix)
print("dimensions")
# reported as rows vs columns
dim(my_matrix)
print("length")
length(my_matrix)
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 1 | 0 |
[1] "structure" num [1:5, 1:5] 0 0 0 0 0 0 0 0 0 0 ... [1] "rows"
[1] "columns"
[1] "dimensions"
[1] "length"
#To access a specific row or column we can still use indexing.
my_matrix[3:5,]
my_matrix[, 4]
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 0 | 1 | 1 | 0 |
# Note that when we are sub-setting a single row or column, we end up with a vector.
is.vector(my_matrix[,4])
# It is common to transform matrices. Note that the set of ones will now be in rows rather than columns.
t(my_matrix)
| 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 |
| 0 | 0 | 0 | 0 | 0 |
Now that we have had the opportunity to create a few different objects, let's talk about what an object class is. An object class can be thought of as how an object will behave in a function. Because of this
class(character_vector)
class(numeric_vector)
class(logical_vector)
Some R package developers have created their own object classes. We won't deal with this today, but it is good to be aware of from a trouble-shooting perspective that your data may need to be formatted to fit a certain class of object when using different packages.
Whereas matrices are limited to a single specific type of data within each instance, data frames are like vectors to the extent that they can hold different types of data. More specifically
This object allows us to generate tables of mixed information much like an Excel spreadsheet.
# Let's make a data frame
my_data_frame <- data.frame(character = c('bacteria', 'virus', 'archaea'),
num = c(1:10),
log = c(TRUE, FALSE, TRUE))
## Oops! This will break the second rule of data frame club
Error in data.frame(character = c("bacteria", "virus", "archaea"), num = c(1:10), : arguments imply differing number of rows: 3, 10
Traceback:
1. data.frame(character = c("bacteria", "virus", "archaea"), num = c(1:10),
. log = c(TRUE, FALSE, TRUE))
2. stop(gettextf("arguments imply differing number of rows: %s",
. paste(unique(nrows), collapse = ", ")), domain = NA)
# Let's make a data frame correctly
my_data_frame <- data.frame(character = c('bacteria', 'virus', 'archaea'),
num = c(1:3),
log = c(TRUE, FALSE, TRUE))
my_data_frame
| character | num | log |
|---|---|---|
| <chr> | <int> | <lgl> |
| bacteria | 1 | TRUE |
| virus | 2 | FALSE |
| archaea | 3 | TRUE |
Many R packages have been made to work with data in data frames, and this is the class of object where we will spend most of our time.
Let's use some of the functions we have learned for finding out about the structure of our data frame.
# What is the structure of a data frame?
str(my_data_frame)
'data.frame': 3 obs. of 3 variables: $ character: chr "bacteria" "virus" "archaea" $ num : int 1 2 3 $ log : logi TRUE FALSE TRUE
We can also convert between data types if they are similar enough. For example, I can convert my matrix into a data frame. Since a data frame can hold any type of data, it can hold all of the numeric data in a matrix.
# cast a matrix to a dataframe
my_matrix
new_data_frame <- as.data.frame(my_matrix)
new_data_frame
| 0 | 1 | 1 | 0 | 1 |
| 0 | 0 | 1 | 0 | 1 |
| 0 | 0 | 1 | 0 | 1 |
| 1 | 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 | 0 |
| V1 | V2 | V3 | V4 | V5 |
|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 0 | 1 | 1 | 0 | 1 |
| 0 | 0 | 1 | 0 | 1 |
| 0 | 0 | 1 | 0 | 1 |
| 1 | 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 | 0 |
colnames()¶Notice that after converting our matrix, the column names have been automatically assigned to generic identifiers. Sometimes you may wish to rename these for whatever reason.
# Note that R just made up column names for us. We can provide our own vector of column names.
colnames(new_data_frame) <- c("col1", "col2", "col3", "col4", "col5")
new_data_frame
#equivalent to
colnames(new_data_frame) <- c(paste0(rep("col", 5), 1:5))
new_data_frame
| col1 | col2 | col3 | col4 | col5 |
|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 0 | 1 | 1 | 0 | 1 |
| 0 | 0 | 1 | 0 | 1 |
| 0 | 0 | 1 | 0 | 1 |
| 1 | 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 | 0 |
| col1 | col2 | col3 | col4 | col5 |
|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 0 | 1 | 1 | 0 | 1 |
| 0 | 0 | 1 | 0 | 1 |
| 0 | 0 | 1 | 0 | 1 |
| 1 | 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 | 0 |
Casting (like coercion) can only be accomplished if the objects or data types (within) are compatible. We can convert our new_data_frame to a matrix but what about my_data_frame?
# In contrast, our data frame with multiple data types cannot be converted into a matrix, as a matrix can only hold one data type. We could however, transform our new_data_frame back into a matrix. The matrix will retain our column heading.
new_matrix <- as.matrix(my_data_frame)
str(new_matrix)
new_matrix
chr [1:3, 1:3] "bacteria" "virus" "archaea" "1" "2" "3" "TRUE" "FALSE" ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:3] "character" "num" "log"
| character | num | log |
|---|---|---|
| bacteria | 1 | TRUE |
| virus | 2 | FALSE |
| archaea | 3 | TRUE |
Notice that the numeric vector is now character!
nrow(new_data_frame) # retrieve the number of rows in a data frame
ncol(new_data_frame) # retrieve the number of columns in a data frame
new_data_frame$column_name # Access a specific column by it's name
new_data_frame[x,y] # Access a specific element located at row x, column y
There are many more ways to access and manipulate data frames that we'll explore further down the road
Arrays are n dimensional objects that hold numeric data. To create an array, we give a vector of data to fill the array, and then the dimensions of the array. This code will recycle the vector 1:10 and fill 5 arrays that have 2 x 3 dimensions. To visualize the array, we will print it afterwards.
my_array <- array(data = 1:10, dim = c(2,3,5))
# Note that we need to print the array in order to make it more human-readable in Jupyter notebooks
print(my_array)
, , 1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
, , 2
[,1] [,2] [,3]
[1,] 7 9 1
[2,] 8 10 2
, , 3
[,1] [,2] [,3]
[1,] 3 5 7
[2,] 4 6 8
, , 4
[,1] [,2] [,3]
[1,] 9 1 3
[2,] 10 2 4
, , 5
[,1] [,2] [,3]
[1,] 5 7 9
[2,] 6 8 10
You can access the elements within an array much like a vector, data frame, or list using the format [row, column, array_number]
# This arrangement makes it more clear how we would subset the number 7 out of array 5.
my_array[1,2,5]
# A 2D array is just a matrix. Unless you specify a 3rd dimension.
twoD_array <- array(data = 1:10, dim = c(2,3))
print(twoD_array)
[,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6
#equivalent to
twoD_array2 <- array(data = , dim = c(2,3,1))
print(twoD_array)
[,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6
# Check the difference between these two arrays
all.equal(twoD_array, twoD_array2)
class(twoD_array)
class(twoD_array2)
I personally can't recall ever using an array per se but I have created array-like objects with lists! I wouldn't worry about it too much but you may encounter these objects every once in while. Speaking of lists...
Lists can hold mixed data types of different lengths. These are especially useful for bundling data of different types for passing around your scripts! Rather than having to call multiple variables by name, you can store them in a single list!
mixed_list <- list(character = c('bacteria', 'virus', 'archaea'),
num = c(1:10),
log = c(TRUE, FALSE, TRUE))
print(mixed_list)
$character [1] "bacteria" "virus" "archaea" $num [1] 1 2 3 4 5 6 7 8 9 10 $log [1] TRUE FALSE TRUE
#formatting - equivalent, but less reader friendly for longer lists
mixed_list = list(character = c('bacteria', 'virus', 'archaea'), num = c(1:10), log = c(TRUE, FALSE, TRUE))
print(mixed_list)
If you forget what is in your list, use the str() function to check out its structure. It will tell you the number of items in your list and their data types. You can (and should) call str() on any R object. You can also try it on one of our vectors.
str(mixed_list)
List of 3 $ character: chr [1:3] "bacteria" "virus" "archaea" $ num : int [1:10] 1 2 3 4 5 6 7 8 9 10 $ log : logi [1:3] TRUE FALSE TRUE
Accessing lists is much like opening up a box of boxes of chocolates. You never know what you're gonna get when you forget the structure!
You can access elements with a mixture of number and naming annotations. Also [[x]] is meant to access the xth "element" of the list.
# To subset for 'virus', I first have to subset for the character element of the list. Kind of like a Russian nested doll or a present, where you have to open the outer layer to get to the next.
mixed_list[[1]]
mixed_list$character
mixed_list[[1]][2]
mixed_list$character[2]
Ah, the dreaded factors! A factor is a class of object used to encode a character vector into categories. They are used to store categorical variables and although it is tempting to think of them as character vectors this is a dangerous mistake (you will get scratched, badly!).
Factors make perfect sense if you are a statistician designing a programming language (!) but to everyone else they exist solely to torment us with confusing errors. A factor is really just an integer vector or character data with an additional attribute, called levels(), which defines the possible values.
crazy_factor = factor(c("up", "down", "down", "sideways", "up"))
crazy_factor
# or
print(crazy_factor)
# the print() function is needed inside iterative functions (e.g. looping) to actually print the ouput that is being generated
levels(crazy_factor)
as.integer(crazy_factor) #Notice the alphabetic rearrangement! It's important to keep this in mind when looping (week 7)
[1] up down down sideways up Levels: down sideways up
Why not just use character vectors, you ask?
Believe it or not factors do have some useful properties. For example, factors allow you to specify all possible values a variable may take even if those values are not in your data set. Think of conditional formatting in Excel.
Since the inception of R, data.frame() calls have been used to create data frames but the default behaviour was to convert strings (and characters) to factors! This is a throwback to the purpose of R, which was to perform statistical analyses on datasets with methods like ANOVA (lecture 06!) which examine the relationships between variables (ie factors)!
As R has become more popular and its applications and packages have expanded, incoming users have been faced with remembering this obscure behaviour, leading to lost hours of debugging grief as they wondering why they can't pull information from their dataframes to do a simple analysis on C. elegans strain abundance via molecular inversion probes in datasets of multiplexed populations. #SuspciouslySpecific
That meant that users usually had to create data frames including the toggle
data.frame(name=character(), value=numeric(), stringsAsFactors = FALSE)
Fret no more! As of R 4.x.x the default behaviour has switched and stringsAsFactors=FALSE is the default! Now if we want our characters to be factors, we must convert them specifically, or turn this behaviour on at the outset of creating each data frame!
# Look at the data frame with and without the stringsAsFactors call
my_data_frame <- data.frame(character = c('bacteria', 'virus', 'archaea'),
num = c(1:3),
log = c(TRUE, FALSE, TRUE))
str(my_data_frame)
'data.frame': 3 obs. of 3 variables: $ character: chr "bacteria" "virus" "archaea" $ num : int 1 2 3 $ log : logi TRUE FALSE TRUE
You can specify that any columns of strings be converted to factors or you can specify specific columns as factors.
# All character vectors become factors
my_data_frame <- data.frame(character = c('bacteria', 'virus', 'archaea'),
num = c(1:3),
log = c(TRUE, FALSE, TRUE), stringsAsFactors = TRUE)
str(my_data_frame)
'data.frame': 3 obs. of 3 variables: $ character: Factor w/ 3 levels "archaea","bacteria",..: 2 3 1 $ num : int 1 2 3 $ log : logi TRUE FALSE TRUE
# Only columns you specify as factors will be factors
my_data_frame <- data.frame(character = as.factor(c('bacteria', 'virus', 'archaea')),
num = c(1:3),
log = c(TRUE, FALSE, TRUE))
str(my_data_frame)
'data.frame': 3 obs. of 3 variables: $ character: Factor w/ 3 levels "archaea","bacteria",..: 2 3 1 $ num : int 1 2 3 $ log : logi TRUE FALSE TRUE
If we look at the structure again, we still have 3 levels. This is because each unique character element has been encoded as a number.
(Note that a column can be subset by index or by its name using the '$' operator.)
my_data_frame$character
#equivalent to
my_data_frame[, 1]
levels(my_data_frame$character)
Note that the first character object in the data frame is 'bacteria', however, the first factor level is archaea. R by default puts factor levels in alphabetical order. This can cause problems if we aren't aware of it.
Always check to make sure your factor levels are what you expect.
With factors, we can deal with our character levels directly, or their numeric equivalents. Factors are extremely useful for performing group calculations as we will see later in the course.
as.numeric(my_data_frame$character)
Look up the factor function. Use it to make 'bacteria' the first level, 'virus' the second level, and 'archaea' the third level for the data frame 'my_data_frame'. Bonus if you can make the level numbers match (1,2,3 instead of 2,3,1). Use functions from the lesson to make sure your answer is correct.
# Run again my_data_frame
my_data_frame <- data.frame(character = c('bacteria', 'virus', 'archaea'),
num = c(1:3),
log = c(TRUE, FALSE, TRUE))
#this is okay - specify your levels explicitly rather than allowing it to choose by default
my_data_frame$character <- factor(my_data_frame$character,
levels = c('bacteria', 'virus', 'archaea'))
str(my_data_frame)
#archaea, bacteria, virus
#2,3,1
'data.frame': 3 obs. of 3 variables: $ character: Factor w/ 3 levels "bacteria","virus",..: 1 2 3 $ num : int 1 2 3 $ log : logi TRUE FALSE TRUE
For certain reasons/models that we will absolutely not cover in this course, you can make your factors ordered which means that there is an order of precedence. This inherent informational order can be used to your advantage when working with data.
# Run again my_data_frame
my_data_frame <- data.frame(character = c('bacteria', 'virus', 'archaea'),
num = c(1:3),
log = c(TRUE, FALSE, TRUE))
#this is okay, what we wanted except implies a < relationship
#class is actually 'ordered' 'factor'
#Ordered factors differ from factors only in their class, but methods and the model-fitting functions treat the two classes quite differently.
my_data_frame$character <- factor(my_data_frame$character,
levels = c('bacteria', 'virus', 'archaea'),
ordered = TRUE)
print(my_data_frame$character)
str(my_data_frame)
levels(my_data_frame$character)
#bacteria, virus, archaea
#1,2,3
[1] bacteria virus archaea Levels: bacteria < virus < archaea 'data.frame': 3 obs. of 3 variables: $ character: Ord.factor w/ 3 levels "bacteria"<"virus"<..: 1 2 3 $ num : int 1 2 3 $ log : logi TRUE FALSE TRUE
Note that you can also label your factors when you make them. You need to be extremely careful with this. You may have good reasons to do this but remember that you are labeling the integer that is associated with the factor level after it has been converted!
# When labeling factors can go wrong
# Run again my_data_frame
my_data_frame <- data.frame(character = c('bacteria', 'virus', 'archaea'),
num = c(1:3),
log = c(TRUE, FALSE, TRUE))
#factor() will decide the level order on it's OWN before applying the given labels!
my_data_frame$character <- factor(my_data_frame$character,
labels = c('bacteria', 'virus', 'archaea'))
print(my_data_frame$character)
str(my_data_frame)
#bacteria, virus, archaea
#BUT named levels above, therefore
#bacteria = archaea, virus = bacteria, archaea = virus
#2,3,1
#BUT data frame changed to:
#virus, archaea, bacteria
[1] virus archaea bacteria Levels: bacteria virus archaea 'data.frame': 3 obs. of 3 variables: $ character: Factor w/ 3 levels "bacteria","virus",..: 2 3 1 $ num : int 1 2 3 $ log : logi TRUE FALSE TRUE
What just happened to our factor levels?
When we called factor(my_data_frame$character, labels = c('bacteria', 'virus', 'archaea')) there was an order of operations that occurred.
factor() was used to cast the vector c('bacteria', 'virus', 'archaea') into a factor and the levels were assigned by alphabetical order. In this case the levels = 'archaea', 'bacteria', 'virus'. If we look back at the order of our vector that makes it (2,3,1).factor() to re-label those integer values with labels in this order: 1='bacteria', 2='virus', 3='archaea'. # Labeling factors correctly
# Run again my_data_frame
my_data_frame <- data.frame(character = c('bacteria', 'virus', 'archaea'),
num = c(1:3),
log = c(TRUE, FALSE, TRUE))
# You need to supply factor() with the levels and labels if you want them turn our how you envision it
my_data_frame$character <- factor(my_data_frame$character,
levels = c('bacteria', 'virus', 'archaea'), # explicitly order your levels!
labels = c('bacteria', 'virus', 'archaea')) # names the levels
#bacteria, virus, archaea
#1,2,3
print(my_data_frame$character)
str(my_data_frame)
[1] bacteria virus archaea Levels: bacteria virus archaea 'data.frame': 3 obs. of 3 variables: $ character: Factor w/ 3 levels "bacteria","virus",..: 1 2 3 $ num : int 1 2 3 $ log : logi TRUE FALSE TRUE
For the most part, factors are important for various statistics involving categorical variables, as you'll see for things like linear models (lecture 06!). Love 'em or hate 'em, factors are integral to using R so better learn to live with them.
Yes, you can treat data frames and arrays like large lists where mathematical operations can be applied to individual elements or to entire columns or more!
First, let's take a look at our data frame
my_data_frame
# or outside of Jupyter use the View() function for a pop-up pane that looks a bit like an excel sheet
# (more familiar to our eyes)
print(my_data_frame)
| character | num | log |
|---|---|---|
| <fct> | <int> | <lgl> |
| bacteria | 1 | TRUE |
| virus | 2 | FALSE |
| archaea | 3 | TRUE |
character num log 1 bacteria 1 TRUE 2 virus 2 FALSE 3 archaea 3 TRUE
Therefore be careful to specify your numeric data for mathematical operations.
my_data_frame
my_data_frame
my_data_frame
# So you can do math... on an array.
print(my_array)
print(my_array)
print(my_array) # why do we have three indices for this array?
apply() function to perform actions across data structures¶The above are illustrative examples to see how our different data structures behave. In reality, you will want to do calculations across rows and columns, and not on your entire matrix or data frame.
For example, we might have a count table where rows are genes, columns are samples, and we want to know the sum of all the counts for a gene. To do this, we can use the apply() function. apply() Takes an array, matrix (or something that can be coerced to such, like a numeric data frame), and applies a function over row (MARGIN = 1) or columns (MARGIN = 2). Here we can invoke the sum function.
counts <- data.frame(Site1 = c(geneA = 2, geneB = 4, geneC = 12, geneD = 8),
Site2 = c(geneA = 15, geneB = 18, geneC = 27, geneD = 28),
Site3 = c(geneA = 10, geneB = 7, geneC = 13, geneD = 15))
counts
#?apply
# This won't work because x is lower case
#apply(x = counts, MARGIN = 1, FUN = sum)
print("apply() with sum")
print("apply() returns a numeric vector")
str(...)
class(...)
Note that the output is no longer a data frame. Since the resulting sums would have the dimensions of a 1x4 matrix, the results are instead coerced to a named numeric vector.
apply() function will recognize basic functions.¶# Using apply() across rows
apply(counts, MARGIN = 1, ...)
apply(counts, MARGIN = 1, ...)
apply(counts, MARGIN = 1, ...)
apply(counts, MARGIN = 1, ...)
When all data values are transformed, the output is a numeric matrix.
What if I want to know something else? We can create a function. The sum function we called before can also be written as a function taking in x (in this case the vector of values from our coerced data frame row by row) and summing them. Other functions can be passed to apply() in this way.
apply(counts, MARGIN = 1, sum)
#equivalent to
apply(counts, MARGIN = 1, ...)
Use the apply() function to multiply the counts for each gene by 3.
apply(counts,
MARGIN = 1,
...)
Missing values in R are handled as NA or (Not Available). Impossible values (like the results of dividing by zero) are represented by NaN (Not a Number). These types of values can be considered null values. These two types of values, specially NAs, have special ways to be dealt with otherwise it may lead to errors in the functions.
# Set up some vectors for an array
brand <- c("wildrose", "guinness", "grasshoper")
wheat.type <- c("Hard Red Spring", "Hard Red Winter", "Soft Red Winter" )
rating <- c(...)
# Put it all together with cbind (you can use this for data frames too!)
NA.example <-
# Use the mean() function and see what happens with NA values
(rating) # some functions need to be explicitly told what to do with NAs. No errors though!
#Avoid using just "T" as an abbreviation for "TRUE"
apply() on data with NAs?¶Now, I am going to take the earlier counts table and add a few NAs. If I now try to calculate the mean number of counts, I will get NA as an answer for the rows that had NAs.
counts <- data.frame(Site1 = c(geneA = 2, geneB = 4, geneC = 12, geneD = 8),
Site2 = c(geneA = 15, geneB = NA, geneC = 27, geneD = 28),
Site3 = c(geneA = 10, geneB = 7, geneC = 13, geneD = NA))
counts
is.na() function to check your data¶How do we find out ahead of time that we are missing data? Knowing is half the battle and is.na() can help us determine this with some data structures.
With a vector we can easily see how some basic functions work.
# Let's check out this vector that contains NA values
na_vector <- c(5, 6, NA, 7, 7, NA)
which() function¶Using is.na() we were returned a logical vector of whether or not a value was NA. There are some ways we can apply this information through different functions but a useful method applicable to a vector of logicals is to ask which() positional indices return TRUE.
In our case, we use which() after checking for NA values in our object.
# Take a look at na_vector before you start manipulating it
na_vector
# wrap which() around our is.na() call
# indices where NAs are present in na_vector
na_values <-
# cut out the na_values indices
removed_na_vector_1 <- ; removed_na_vector_1
#equivalent to
removed_na_vector_2 <- na_vector[] ; removed_na_vector_2
# equivalentish to
(na_vector)
complete.cases() to query larger objects¶With a large data frame, it may be hard to look at every cell to tell if there are NAs. You may encounter data sets which are missing values or have NA values due to experimental limitations. How do you easily select out rows of invalid data from your analysis?
The function complete.cases() looks by row to see whether any row contains an NA. You can then subset out the rows with the NAs.
# Let's see what is.na() returns
counts
?is.na
# logical (TRUE or FALSE). Where in the data frames are the NAs if any?
# Let's combine it with the any() function
?any
# "Given a set of logical vectors, is at least one of the values true?" In our case, are there NAs at all in counts?
?complete.cases
# Outputs a logical vector specifying which observations/rows have no missing values across the entire sequence.
# Use it wisely to grab rows. Pop quiz [x,y] will it be x or y?
counts[,]
If you want to keep all of the observations in your data frame and do your calculations anyways, now that you are aware of what is going on in your data set, some functions specifically allow for this. Let's look up the documentation for the mean() function.
counts
... # What's the problem?
?mean
# If used as a stand-alone function, mean() takes a vector
mean(counts$)
mean(counts$,...)
# We can pass mean() to each column/row of a data frame using apply() (part of the apply family of functions)
# apply by row
apply(...)
# apply by column
apply(...)
# Apply the log function to non-NA observations. In this case na.omit can be useful.
#?na.omit
apply(counts, MARGIN = 1, ...)
We will focus more on the apply() family of functions later on in the course.
You can similarly deal with NaN's in R. NaN's (not a number) are NAs (not available), but NAs are not NaN's. NaN's appear for imaginary or complex numbers or unusual numeric values. Some packages may output NAs, NaN's, or Inf/-Inf (can be found with is.finite() ).
na_vector <- c(5, 6, NA, 7, 7, NA)
nan_vector <- c(5, 6, NaN, 7, 7, 0/0)
(na_vector)
(nan_vector)
(nan_vector)
(nan_vector)
# These type of operations are very useful when working with conditional statements (if else, while, etc.).
Basically, if you come across NaN's, you can use the same functions such as complete.cases() that you use with NAs.
Depending on your purpose, you may replace NAs with a sample average, or the mode of the data, or a value that is below a threshold.
# Find the NA values and just replace them with an equivalent value that makes sense to your analysis
counts[...] <- 0
counts
There are a few different places you can install packages from R. Listed in order of decreasing sketchiness:
CRAN (The Comprehensive R Archive Network)
Bioconductor (Bioinformatics/Genomics focus)
GitHub
</br>
Regardless where you download a package from, it's a good idea to document that installation, especially if you had to troubleshoot that installation (you'll eventually be there, I promise!)
devtools is a package that is used for developers to make R packages, but it also helps us to install packages from GitHub. It is downloaded from CRAN.
Installing packages through your JupyterHub notebook is relatively straightforward but any packages you install only live during your current instance of the hub. Whenever you logout from the JupyterHub, these installed libraries will essentially vaporize.
The install.packages() command will work just as it should in R and RStudio. Find instructions in the Appendix section for installation of packages into your own personal Anaconda-based installation of Jupyter Notebook.
# Always keep installation commands commented out
R may give you package installation warnings. Don't panic. In general, your package will either be installed and R will test if the installed package can be loaded, or R will give you a non-zero exit status - which means your package was not installed. If you read the entire error message, it will give you a hint as to why the package did not install.
Some packages depend on previously developed packages and can only be installed after another package is installed in your library. Similarly, that previous package may depend on another package... here is the solution to install the package and all of the prior packages it relies on.
install.packages('devtools', ...)
remove.packages("devtools") # Uninstall any CRAN package
library() to load your packages after installation¶A package only has to be installed once. It is now in your library. To use a package, you must load the package into memory. Unless this is one of the packages R loads automatically, you choose which packages to load every session.
library() Takes a single argument. We are going to write a function later in the course to load a character vector of packages. library() will throw an error if you try to load a package that is not installed. You may see require() on help pages, which also loads packages. It is usually used inside functions (it gives a warning instead of an error if a package is not installed). Errors will stop code from running while warnings allow code to run until an error is reached.
...
# or
#library()
myPackages <- c("package_1", "package_2", "package_2")
lapply(myPackages, library, character.only = TRUE)
BiocManager()¶To install from Bioconductor you can either always use source BiocManager() to help pull down and install packages from the bioconductor repository.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager") # this piece of code checks if BiocManager is installed. If is not installed, it will do it for you. It does nothing if BiocManager is already installed.
...::...("GenomicRanges")
#or
c("GenomicRanges", "ConnectivityMap")
package::function()¶As forementioned devtools is required to install from GitHub. We don't actually need to load the entire library for devtools if we are only going to use one function. We select a function using this syntax package::function().
("jennybc/googlesheets")
All packages are loaded the same regardless of their origin, using library().
...
Now, install and load the packages tidyverse and limma.
install.packages('tidyverse', ...)
#didn't have to have 'dependencies = TRUE', but may have run into some issues
BiocManager::install("limma")
library(...)
Note on data set formatting: R works best when each column is a single variable (length, size, pressure) and each row contains one and only observation per variable. Thus, columns are called "variables" and rows are called "observations". We will go over proper format for data sets in lecture 3. Let's explore the iris data set:
() # what do the first few rows look like?
(iris) # what is the overall structure of this data set?
(iris) # what is the class of this data set?
head(...(iris)) # Are there any NA values?
...(is.na(iris)) # Remember what this does?
is.na(iris) # What do you think we'll get back?
Make sure that there are no other files named "iris_data set" in your working directory because they'll be overwritten without warning (will be gone forever!!!!!!!!!!!!!!!! unless you have backups, which you ALWAYS should have for important files). Let's write iris as a comma separated (csv) file to your working directory:
# Before writing, take a look your working directory.
getwd()
list.files() # check existing files in your current directory.
?write.csv
write.csv(...,
file = ...,
row.names = ...,
quote = ...)
list.files(...)
Write iris as a comma separated (csv) file wherever you want on you computer without changing your working directory:
# Write to an absolute location on your directory
?write.csv
write.csv(iris,
file = ...,
row.names = FALSE,
quote = FALSE)
There are a number of methods or functions that we can use to read in files. Many of these functions are targeted for specific types of data formats but one you might use often is the read.csv() command. I personally use read.table() quite often as well.
getwd()
list.files()
?read.csv
iris_dataset <- read.csv(...,
check.names = ..., stringsAsFactors = ...,
header = ..., sep = ",", fill = ...)
(iris_dataset) # check the structure
(iris_dataset) # check the class
Next week we'll talk more about your data and quick ways to manipulate and tidy it up some more. For now let's quickly look at head() and tail()
# Let's investigate the file we just read in.
head(iris_dataset, ...)
tail(iris_dataset, ...)
By the end of this course, these commands will seem like second nature to you but we won't learn some of them until lecture 03 (plot() and boxplot()) and even lecture 06 (qqnorm() and qqline()).
Still, let's try some commands on our dataset to get a feel for your future!
#A brief on exploratory data analysis
#boxplot(iris_dataset)
iris_dataset$Species <- # We need to make "Species" a factor with its levels
...(iris_dataset)
...(iris_dataset$Petal.Width)
...(iris_dataset$Sepal.Length)
...(iris_dataset$Sepal.Length, col = 2) # Does iris' data come from a normally distributed population?
http://archive.ics.uci.edu/ml/datasets/Adult
https://github.com/patrickwalls/R-examples/blob/master/LinearAlgebraInR.Rmd
http://stat545.com/block002_hello-r-workspace-wd-project.html
http://stat545.com/block026_file-out-in.html
http://sisbid.github.io/Module1/
https://github.com/tidyverse/readxl
https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
https://swcarpentry.github.io/r-novice-inflammation/06-best-practices-R/
http://stat545.com/help-general.html
https://stackoverflow.com/help/how-to-ask
https://www.r-project.org/posting-guide.html
http://www.quantide.com/ramarro-chapter-07/
Complete Introduction to R assignment on DataCamp
If you'll be running your own installation of Jupyter Notebook or R/RStudio, please install the following packages from the conda-forge channel (or CRAN) for next time:
tidyverse (tidyverse installs several packages for you, like dplyr, readr, readxl, tibble, and ggplot2)googlesheetsFor this introductory course we will be teaching and running code for R through Jupyter notebooks. In this section we will discuss
As of 2021-01-18, The latest version of Anaconda3 runs with Python 3.8
Download the OS-appropriate version from here https://www.anaconda.com/products/individual
All versions should come with Python 3.8
Windows:
MacOS:
Unix:
As of 2020-12-11, the lastest version of r-base available for Anaconda is 4.0.3 but Anaconda comes pre-installed with R 3.6.1. To save time, we will update just our r-base (version) through the command line using the Anaconda prompt. You'll need to find the menu shortcut to the prompt in order to run these commands. Before class you should update all of your anaconda packages. This will be sure to get you the latest version of Jupyter notebook. Open up the Anaconda prompt and type the following command:
conda update --all
It will ask permission to continue at some point. Say 'yes' to this. After this is completed, use the following command:
conda install -c conda-forge/label/main r-base=4.0.3=hddad469_3
Anaconda will try to install a number of R-related packages. Say 'yes' to this.
Lastly, we want to connect your R version to the Jupyter notebook itself. Type the following command:
conda install -c r r-irkernel
Jupyter should now have R integrated into it. No need to build an extra environment to run it.
You may find that for some reason or another, you'd like to maintain a specific R-environment (or other) to work in. Environments in Anaconda work like isolated sandbox versions of Anaconda within Anaconda. When you generate an environment for the first time, it will draw all of its packages and information from the base version of Anaconda - kind of like making a copy. You can also create these in the Anaconda prompt. You can even create new environments based on specific versions or installations of other programs. For instance, we could have tried to make an environment for R 4.0.3 with the command
conda create -n my_R_env -c conda-forge/label/main r-base=4.0.3=hddad469_3
This would create a new environment with version 4.0.3 of R but the base version of Anaconda would retain version 3.6.1 of R. A small but helpful detail if you are unsure about newer versions of packages that you'd like to use.
Likewise, you can update and install packages in new environments without affecting or altering your base environment! Again it's helpful if you're upgrading or installing new packages and programs. If you're not sure how it will affect what you already have in place, you can just install them straight into an environment.
For more information: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#cloning-an-environment
If you are inclined, the Anaconda Navigator can help you make an R environment separate from the base, but you won't be able to perform the same fancy tricks as in the prompt, like installing new packages directly to a new environment.
Note: You should consider doing this only if you have a good reason to isolate what you're doing in R from the Anaconda base packages. You will also need to have installed r-base 4.0.3 to make a new environment with it through the Anaconda navigator.
The Anaconda navigator is a graphical interface that shows all fo your pre-installed packages and give you access to installing other common programs like RStudio (we'll get to that in a moment).
You will now have an R environment where you can install specific R packages that won't make their way into your Anaconda base.
You will likely find a shortcut to this environment in your (Windows) menu under the Anaconda folder. It will look something like Jupyter Notebook (R-4-0-3)
Normally I suggest avoiding installing packages through your Jupyter Notebook. Instead, if you want to update your R packages for running Jupyter, it's best to add them through either the Anaconda prompt or Anaconda navigator. Again, using the prompt gives you more options but can seem a little more complicated.
One of the most useful packages to install for R is r-essentials. Open up the Anaconda prompt and use the command:
conda install -c r r-essentials. After running, the Anaconda prompt will inform you of any package dependencies and it will identify which packages will be updated, newly installed, or removed (unlikely).
Anaconda has multiple channels (similar to repositories) that exist and are maintained by different groups. These various channels port over regular R packages to a format that can be installed in Anaconda and run by R. The two main channels you'll find useful for this are the r channel and conda-forge channel. You can find more information about all of the packages on docs.anaconda.com. As you might have guessed the basic format for installing packages is this: conda install -c channel-name r-package where
conda-install is the call to install packages. This can be done in a base or custom environment
-c channel-name identifies that you wish to name a specific channel to install from
r-package is the name of your package and most of them will begin with r- ie r-ggplot2
As of 2020-06-25, the latest stable R version is 4.0.3:
Windows:
- Go to <http://cran.utstat.utoronto.ca/>
- Click on 'Download R for Windows'
- Click on 'install R for the first time'
- Click on 'Download R 4.0.3 for Windows' (or a newer version)
- Double-click on the .exe file once it has downloaded and follow the instructions.
(Mac) OS X:
- Go to <http://cran.utstat.utoronto.ca/>
- Click on 'Download R for (Mac) OS X'
- Click on R-4.0.3.pkg (or a newer version)
- Open the .pkg file once it has downloaded and follow the instructions.
Linux:
- Open a terminal (Ctrl + alt + t)
- sudo apt-get update
- sudo apt-get install r-base
- sudo apt-get install r-base-dev (so you can compile packages from source)
As of 2021-01-18, the latest RStudio version is 1.4.1103
Windows:
- Go to <https://www.rstudio.com/products/rstudio/download/#download>
- Click on 'RStudio 1.3.1093 - Windows Vista/7/8/10' to download the installer (or a newer version)
- Double-click on the .exe file once it has downloaded and follow the instructions.
(Mac) OS X:
- Go to <https://www.rstudio.com/products/rstudio/download/#download>
- Click on 'RStudio 1.3.1093 - Mac OS X 10.13+ (64-bit)' to download the installer (or a newer version)
- Double-click on the .dmg file once it has downloaded and follow the instructions.
Linux:
- Go to <https://www.rstudio.com/products/rstudio/download/#download>
- Click on the installer that describes your Linux distribution, e.g. 'RStudio 1.3.1093 - Ubuntu 18/Debian 10(64-bit)' (or a newer version)
- Double-click on the .deb file once it has downloaded and follow the instructions.
- If double-clicking on your .deb file did not open the software manager, open the terminal (Ctrl + alt + t) and type **sudo dpkg -i /path/to/installer/rstudio-xenial-1.3.959-amd64.deb**
_Note: You have 3 things that could change in this last command._
1. This assumes you have just opened the terminal and are in your home directory. (If not, you have to modify your path. You can get to your home directory by typing cd ~.)
2. This assumes you have downloaded the .deb file to Downloads. (If you downloaded the file somewhere else, you have to change the path to the file, or download the .deb file to Downloads).
3. This assumes your file name for .deb is the same as above. (Put the name matching the .deb file you downloaded).
If you have a problem with installing R or RStudio, you can also try to solve the problem yourself by Googling any error messages you get. You can also try to get in touch with me or the course TAs.
RStudio is an IDE (Integrated Development Environment) for R that provides a more user-friendly experience than using R in a terminal setting. It has 4 main areas or panes, which you can customize to some extent under Tools > Global Options > Pane Layout:
All of the panes can be minimized or maximized using the large and small box outlines in the top right of each pane.
The Source is where you are keeping the code and annotation that you want to be saved as your script. The tab at the top left of the pane has your script name (i.e. 'Untitled.R'), and you can switch between scripts by toggling the tabs. You can save, search or publish your source code using the buttons along the pane header. Code in the Source pane is run or executed automatically.
To run your current line of code or a highlighted segment of code from the Source pane you can:
a) click the button 'Run' -> 'Run Selected Line(s)',
b) click 'Code' -> 'Run Selected Line(s)' from the menu bar,
c) use the keyboard shortcut CTRL + ENTER (Windows & Linux) Command + ENTER (Mac) (recommended),
d) copy and paste your code into the Console and hit Enter (not recommended).
There are always many ways to do things in R, but the fastest way will always be the option that keeps your hands on the keyboard.
You can also type and execute your code (by hitting ENTER) in the Console when the > prompt is visible. If you enter code and you see a + instead of a prompt, R doesn't think you are finished entering code (i.e. you might be missing a bracket). If this isn't immediately fixable, you can hit Esc twice to get back to your prompt. Using the up and down arrow keys, you can find previous commands in the Console if you want to rerun code or fix an error resulting from a typo.
On the Console tab in the top left of that pane is your current working directory. Pressing the arrow next to your working directory will open your current folder in the Files pane. If you find your Console is getting too cluttered, selecting the broom icon in that pane will clear it for you. The Console also shows information: upon start up about R (such as version number), during the installation of packages, when there are warnings, and when there are errors.
In the Global Environment you can see all of the stored objects you have created or sourced (imported from another script). The Global Environment can become cluttered, so it also has a broom button to clear its workspace.
Objects are made by using the assignment operator <-. On the left side of the arrow, you have the name of your object. On the right side you have what you are assigning to that object. In this sense, you can think of an object as a container. The container holds the values given as well as information about 'class' and 'methods' (which we will come back to).
Type x <- c(2,4) in the Console followed by Enter. 1D objects' data types can be seen immediately as well as their first few values. Now type y <- data.frame(numbers = c(1,2,3), letters = c("a","b","c")) in the Console followed by Enter. You can immediately see the dimension of 2D objects, and you can check the structure of data frames and lists (more later) by clicking on the object's arrow. Clicking on the object name will open the object to view in a new tab. Custom functions created in session or sourced will also appear in this pane.
The Environment pane dropdown displays all of the currently loaded packages in addition to the Global Environment. Loaded means that all of the tools/functions in the package are available for use. R comes with a number of packages pre-loaded (i.e. base, grDevices).
In the History tab are all of the commands you have executed in the Console during your session. You can select a line of code and send it to the Source or Console.
The Connections tab is to connect to data sources such as Spark and will not be used in this lesson.
The Files tab allows you to search through directories; you can go to or set your working directory by making the appropriate selection under the More (blue gear) drop-down menu. The ... to the top left of the pane allows you to search for a folder in a more traditional manner.
The Plots tab is where plots you make in a .R script will appear (notebooks and markdown plots will be shown in the Source pane). There is the option to Export and save these plots manually.
The Packages tab has all of the packages that are installed and their versions, and buttons to Install or Update packages. A check mark in the box next to the package means that the package is loaded. You can load a package by adding a check mark next to a package, however it is good practice to instead load the package in your script to aid in reproducibility.
The Help menu has the documentation for all packages and functions. For each function you will find a description of what the function does, the arguments it takes, what the function does to the inputs (details), what it outputs, and an example. Some of the help documentation is difficult to read or less than comprehensive, in which case goggling the function is a good idea.
The Viewer will display vignettes, or local web content such as a Shiny app, interactive graphs, or a rendered html document.
I suggest you take a look at Tools -> Global Options to customize your experience.
For example, under Code -> Editing I have selected Soft-wrap R source files followed by Apply so that my text will wrap by itself when I am typing and not create a long line of text.
You may also want to change the Appearance of your code. I like the RStudio theme: Modern and Editor font: Ubuntu Mono, but pick whatever you like! Again, you need to hit Apply to make changes.
That whirlwind tour isn't everything the IDE can do, but it is enough to get started.